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Abstract 

This technical report examines the fundamental ambiguities and uncertainties inherent in recov- 
ering structure from motion. By examining the eigenvectors associated with null or small eigen- 
values of the Hessian matrix, we can quantify the exact nature of these ambiguities and predict how 
they affect the accuracy of the reconstructed shape. Our results for orthographic cameras show that 
the bas-relief ambiguity is significant even with many images, unless a large amount of rotation 
is present. Similar results for perspective cameras suggest that three or more frames and a large 
amount of rotation are required for metrically accurate reconstruction. 
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1 Introduction 

Structure from motion is one of the classic problems in computer vision and has received a great 
deal of attention over the last decade. It has wide-ranging applications, including robot vehicle 
guidance and obstacle avoidance, and the reconstruction of 3-D models from imagery. Unfortu- 
nately, the quality of results available using this approach is still often very disappointing. More 
precisely, while the qualitative estimates of structure and motion look reasonable, the actual quan- 
titative (metric) estimates can be significantly distorted. 

Much progress has been made recently in identifying the sources of errors and instabilities in 
the structure from motion process. It is now widely understood that the arbitrary algebraic manip- 
ulation of the imaging equations to derive closed-form solutions (e.g., [LH81]) can lead to algo- 
rithms that are numerically ill-conditioned or unstable in the presence of measurement errors. To 
overcome this, statistically optimal algorithms for estimating structure and motion have been devel- 
oped [SA89; WAH89; Hor90; TK92b; SK94]. It is also understood that using more feature points 
and images results in better estimates, and that certain configurations of points (at least in the two 
frame case) are pathological and cannot be reconstructed. 

An example of an algorithm which generates very good results is the factorization approach of 
Tomasi and Kanade [TK92b]. This algorithm assumes orthography and is implemented using an 
object-centered representation and singular value decomposition. It uses many points and frames, 
and for most sequences, a large amount of object rotation (usually 360°). However, when only a 
small range of viewpoints is present (e.g., the "House" sequence in [TK92b], Figure 7), the recon- 
struction no longer appears metric (the house walls are not perpendicular). 

In this technical report, we demonstrate that it is precisely this last factor, i.e., the overall ro- 
tation of the object, or equivalently, the variation in viewpoints, which critically determines the 
quality of the reconstruction. The ambiguity in object shape due to small viewpoint variation of- 
ten looks like it might be a projective deformation of the Euclidean shape, which is interesting — 
several researchers have argued recently in favor of trying to recover only this projective structure 
[Fau92; HGC92; MQVB92; Sha93] . In fact, we show that the major ambiguity in the reconstruction 
is a simple depth scale uncertainty, i.e., the classic bas-relief ambiguity which exists for two-frame 
structure from motion under orthographic projection [LH86]. 1 

1 The bas-relief ambiguity is even more pronounced in shape from shading, and forms the basis of classical friezes 
and bas-relief sculptures. 
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2 Previous work 



To derive our results, we use eigenvalue analysis of the covariance matrix for the structure and 
motion estimates. This assumes that we can compute a near optimal solution, and that the error in 
the solution is due to linear perturbations arising from small amounts of image noise (feature point 
mislocalization). This kind of analysis has not previously been applied to structure from motion, 
and yet it is a very powerful way to predict the ultimate performance of structure from motion al- 
gorithms. 

Our results are significant for two reasons. First, we show how to theoretically derive the ex- 
pected ambiguity in a reconstruction, and also derive some intuitive guidelines for selecting imag- 
ing situations which can be expected to produce reasonable results. Second, since the primary am- 
biguities are very well characterized by a small number of modes, this information can be used to 
construct better on-line (recursive) estimation algorithms. 

Our technical report is structured as follows. After reviewing previous work, we present our 
formulation of the structure from motion problem and develop our technique for analyzing ambi- 
guities using eigenvector analysis of the information (Hessian) matrix. We then present the results 
of our analysis for a series of camera models: 1-D and 2-D orthographic cameras, and 1-D and 2-D 
perspective cameras. We conclude with a discussion of the main sources of errors and ambiguities, 
and directions for possible future work. 

2 Previous work 

Structure from motion has been extensively studied in computer vision. Early papers on this sub- 
ject [LH81; TH84] develop algorithms to compute the structure and motion from a small set of 
points matched in two frames using an essential parameter approach. The performance of this ap- 
proach can be significantly improved using non-linear least squares (optimal estimation) techniques 
[WAH89; WAH93; SA89; Hor90; SA91]. 

Recent research focuses on extraction of shape and motion from longer image sequences [KTJ89; 
DA90; CWC90; TK92b; CT92]. Cui, Weng, and Cohen [CWC90] use an optimal estimation tech- 
nique (non-linear least squares) between each pair of frames, and an extended Kalman filter to accu- 
mulate information over time (see also [TH093; SPFP93]). Azarbayejani et al. [AHP93] also use a 
Kalman filter-based approach to recover rigid (object-centered) depth and motion directly from the 
sequence of image measurements. Tomasi and Kanade [TK92b] use a factorization method which 
extracts shape and motion from an image stream without computing camera-centered depth. Their 
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approach formulates the shape from motion problem in object-centered coordinates, assumes or- 
thography, and processes all of the frames simultaneously. Chen and Tsuji [CT92] relax the as- 
sumption of orthography by analyzing the image sequence through its temporal and spatial subparts. 
Taylor and Kriegman [TKA91; TK92a] formulate the shape from motion task as a non-linear least 
squares problem in which the Euclidean distance between the estimated and actual positions of the 
points in the image sequence is minimized using the Levenberg-Marquardt algorithm. Szeliski and 
Kang [SK94] extend this approach approaches to general 3-D structure and also to projective struc- 
ture and motion recovery. 

Another line of research has addressed recovering affine [KvD91; SZB93] or projective [Fau92; 
HGC92; HG93; MVQ93] structure estimates. Most of these techniques rely on identifying and 
tracking a small number of feature points in the image sequence, using these points to form a basis 
set for the geometric description, and also only use 2 frames to recover the geometry. However, 
Mohr et al. [MVQ93] and Szeliski and Kang [SK94] use as many points and frames as possible to 
recover the geometry and motion, thus producing more reliable estimates. 

The nature of structure and motion errors, which is the main focus of this technical report, has 
also previously been studied. Weng et al. perform some of the earliest and most detailed error anal- 
yses of the two-frame essential parameter approach [WAH89; WAH93]. Adiv [Adi89] and Young 
and Chellappa [YC92] analyze continuous-time (optical flow) based algorithms using the concept 
of the Cramer- Rao lower bound. Oliensis and Thomas [OT91; TH093] show how modeling the 
motion error can significantly improve the performance of recursive algorithms. 

In this technical report, we extend these previous results using an eigenvalue analysis of the 
covariance matrix. This analysis can pinpoint the exact nature of structure from motion ambiguities 
and the largest sources of reconstruction error. We also focus on multi-frame optimal structure from 
motion algorithms, which have not been studied in great detail. 

3 Problem formulation and uncertainty analysis 

Structure from motion can be formulated as the recovery of a set of 3-D structure parameters p 8 
and time-varying motion parameters m 3 from a set of observed image features u 8J . In this section, 
we present the forward equations, i.e., the rigid body and perspective transformations which map 
3-D points into 2-D image points. We also show how the Jacobians of the forward equation can 
be used to estimate the inverse covariance matrix for the parameters being recovered, how this can 
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3 Problem formulation and uncertainty analysis 



be used to quantify expected reconstruction errors, and how our results relate to classical structure 
from motion ambiguities. 



3.1 Problem formulation 

The equation which projects the ith 3-D point p t into the jth frame at location u l3 is 



u u ='P(T(p i ,m j )). 



(1) 



The perspective projection V (defined below) is applied to a rigid transformation 



T(pi,mj) = RjPi + t 



(2) 



where FL, is a rotation matrix and 1 3 is a translation applied after the rotation. A variety of alternative 
representations are possible for the rotation matrix [Aya91]. In this technical report, we primarily 
use a quaternion q = [w, (q 0} q i} q 2 )] representation, with a corresponding rotation matrix 



R(q) 



/ 1 - 2q\ - 2q\ 
2q 0 q 1 - 2wq 2 
y 2q 0 q 2 + 2wq x 



2q 0 q 1 + 2wq 2 
1 - 2ql - 2q\ 
2q x q 2 - 2wq 0 



2q 0 q 2 - 2wqi ^ 
2q x q 2 + 2wq 0 
1 - 2ql - 2q\ y 



(3) 



since this representation has no singularities. The rotation parameters <jb, <?2 a ^ so nave a natural 
interpretation (for small values) as the half-angles of rotation around the x, y, and z axes. For our 
one-dimensional examples, we use the rotation angle around the vertical axis. 
The standard perspective projection equation used in computer vision is 




y 

\ z / 




(4) 



where / is a product of the focal length of the camera and the pixel scale factor (assuming that pixels 
are square). An alternative object-centered formulation, which we introduced in [SK94] is 
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m. 



0 



Figure 1: Sample configuration of cameras (rrij), 3-D points (p 8 ), image planes(rij), and screen 
locations (u 8J ) 

Here, we assume that the (x,y,z) coordinates before projection are with respect to a reference frame 
Ilj that has been displaced away from the camera by a distance t z along the optical axis, with s = 
f/t z and r] = l/t z (Figure 1). The projection parameter s can be interpreted as a scale factor and r/ 
as a perspective distortion factor. Our alternative perspective formulation allows us to model both 
orthographic and perspective cameras using the same model. 

A variety of techniques (reviewed in Section 2) can be used to estimate the unknowns {p 8 , rrij} 
from the given image measurements {u 8J }. In our previous work [SK94], we used the iterative 
Levenberg-Marquardt algorithm, since it provides a statistically optimal solution [WAH89; SA89; 
TK92a; SK94]. The Levenberg-Marquardt method is a standard non-linear least squares technique 
[PFTV92] which directly minimizes a merit or objective function 



where u 8J is the observed image measurement, fi?(a) = u(p 8 , rrij) is given in (1), and the vector 
a contains all of the unknown structure and motion parameters, including the 3-D points p 8 , the 
motion parameters rrij, and any additional unknown calibration parameters. The weight Qj in (6) 
describes the confidence in measurement u l3 , and is normally set to the inverse variance (it can 



C(a) = EE 



(6) 



* 3 
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3 Problem formulation and uncertainty analysis 



be set to zero for missing measurements). 



3.2 Uncertainty analysis 



Regardless of the solution technique, the uncertainty in the recovered parameters — assuming that 
image measurements are corrupted by small Gaussian noise errors — can be determined by comput- 
ing the inverse covariance or information matrix A [Sor80] . This matrix is formed by computing 
outer products of the Jacobians of the measurement equations 



a = EE 

* 3 

For notational succinctness, we use the symbol 



da da T 



(7) 



H, 



9f T 
U_ 

9m, 



9f 5 



to denote the non-zero portion of the full Jacobian 9a . 

If we list the structure parameters {p 8 } first, followed by the motion parameters {rrij}, the A 
matrix has the structure 

A A 

(8) 



(9) 





A 


A T 

pm 


A 



The matrices A p and A m are block diagonal, with diagonal entries 



Y dPi dpi 



and A r 



J2 Qi l d ^3 



dm j dmj ' 



respectively (assuming q 



1), while A pm is dense, with entries 
A r 



cnj 3 din 



(10) 



dpi dmj 

The information matrix has previously been used in the context of structure from motion to de- 
termine Cramer-Rao lower bounds on the parameter uncertainties by taking the inverse of the diag- 
onal entries [Adi89; YC92]. The Cramer-Rao bounds, however, can be arbitrarily weak, especially 
when A is singular or near- singular. In this technical report, we use eigenvector analysis of A to 
find the dominant directions in the uncertainty (covariance) matrix and their magnitudes, which 
gives us more insight into the exact nature of structure from motion ambiguities. 
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3.3 Estimating reconstruction errors 

An important benefit of uncertainty analysis is that we can easily quantify the expected amount of 
reconstruction (and motion) error for an optimal structure from motion algorithm. For example, the 
expected sum of squared error in reconstructed 3-D point positions is 

Hp* -pIII 2 )> (ID 

where p t axe the estimated (recovered) positions and p* the true positions. The positional uncer- 
tainty matrix C p can be computed by inverting A and looking at its upper left block (the block 
corresponding to the p t variables). 2 If we perform an eigenvalue analysis of C p , we obtain 

C p = EpApEp, (12) 

where E p is the matrix of eigenvectors, and A p is the diagonal matrix containing the eigenvalues 
of C p . Since S^ os is a Euclidean norm, its value is unaffected by orthogonal coordinate transfor- 
mations such as E p . The value of S^ os can thus be computed as either the trace of C p or the trace 
of A p , i.e., the sum of the eigenvalues of C p . 

In practice, we do not need to compute C p . Instead, the sum of squared reconstruction and 
motion error, 

Slu = (E Hp* - pIH 2 + E IK " mill 2 ) , d3) 

\ i j I 

can be computed directly summing the inverse eigenvalues of the information matrix A. By choos- 
ing an appropriate scaling for the parameters being estimated (say scaling positions to be in the 
range [—100 . . . 100] and rotations in the range [—it . . . ir]), we can make the mean of S a u be close 
to the mean of S pos . Note that for general 3-D camera motion, positional errors in the motion esti- 
mates will be on the same scale as 3-D reconstruction errors, and may sometimes dominate (if the 
absolute distance of the camera is ill determined). 

What is the advantage of this approach, if computing eigenvalues is just as expensive as invert- 
ing matrices? First, we can compute the first few eigenvalues more cheaply (and in less space) than 
the matrix inverse, and these tend to dominate the overall reconstruction error. Second, it justifies 
the approach in the technical report, which is to look at the minimum eigenvalue as the prime in- 
dicator of reconstruction error. We can therefore study how much certain ambiguities (such as the 

2 Note that this is not the same as simply inverting A p . 
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bas-relief ambiguity) contribute to the overall reconstruction error. We can also obtain much tighter 
lower bounds on the reconstruction error than would be possible by using the Cramer-Rao bounds. 

3.4 Ambiguities in structure from motion 

Because structure from motion attempts to recover both the structure of the world and the camera 
motion without any external (prior) knowledge, it is subject to certain ambiguities. The most fun- 
damental (but most innocuous) of these is the coordinate frame (also known as pose, or Euclidean) 
ambiguity, i.e., we can move the origin of the coordinate system to an arbitrary place and pose and 
still obtain an equally valid solution. 

The next most common ambiguity is the scale ambiguity (for a perspective camera) or the depth 
ambiguity (for an orthographic camera). This ambiguity can be removed with a small amount of 
additional knowledge, e.g., the absolute distance between camera positions. 

A third ambiguity, and the one we focus on in this technical report, is the bas-relief ambiguity. 
In its pure form, this ambiguity occurs for a two frame problem with an orthographic camera, and 
is a confusion between the relative depth of the object and the amount of object rotation. In this 
technical report, we focus on the weak form of this ambiguity, i.e., the very large bas-relief uncer- 
tainty which occurs with imperfect measurements even when we use more than two frames and/or 
perspective cameras. A central result of this technical report is that the bas-relief ambiguity cap- 
tures the largest uncertainties arising in structure from motion. However, when examined in detail, 
it appears that a larger class of deformations (i.e., projective) more fully characterizes the errors 
which occur in structure from motion. 

To characterize these ambiguities, we will use eigenvector analysis of the information matrix, 
as explained in Section 3.2. Absolute ambiguities will show up as zero eigenvalues (unless we add 
additional constraints or knowledge to remove them), whereas weak ambiguities will show up as 
small eigenvalues. 

4 A two parameter example 

To develop an intuitive understanding of the basic bas-relief ambiguity, we start with a simple two- 
parameter example. Assume that we have an orthographic scanline camera which measures the x 
component of 2-D points (x,z). Furthermore, assume that we already know the shape up to a scale 
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(a) Noise-free (b) Noise = 0.01 (c) Contour plot 



Figure 2: Constraint lines and energy surface for simple two-parameter example. The x-axis is the 
angle AO and the y-axis is the scale factor a. 

factor in depth, 

p 8 = (xi,azi) 

and that the rotation angles are uniform, 

0 3 = 3 AO. 

The projection equation is then 

with c 3 = cos 0 and = sin 0 , . 

What happens when we try to estimate the scale factor a and the angle AO from a set of noisy 
measurements First, let's examine the very simplest case, which is a single point, say at 

(x,z) = (1,1). Each new image gives us a constraint of the form 

Cj — aSj = c* — a* s* + ;/ , (15) 

where c*, s*, and a* are the true values and n 3 is random noise. Figure 2a shows the two constraint 
lines for j = ±1 assuming the noise-free case (with a = 1 and AO = 0.1 rad). Figure 2b shows 
the constraint lines for n_i = rii = 0.01. As can be seen, the estimate for (A#, a) is very sensitive 
to noise. This can also be seen in the contour plot of the energy surface (Figure 2c) which can be 
computed by summing the constraints in (15). 
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5 Orthography: single scanline 



To characterize the shape of the error surface near its minimum, we compute the information 
matrix A. The Jacobian for (a, AO) is straightforward, 



H, 



da 
du tJ 
8A6 



-j(ac J z t + SjXi) 



-J 



A9z t 
azi + jAOxi 



(16) 



if we assume small rotation angles, \9 3 \ <^ 1, so that sj sa j AO and c 3 sa 1. The inverse covariance 
(information) matrix is then 



AO 2 



aA9 



aA9 a 2 + A9 2 ^ 



(17) 



where J 2 = Ej J 2 , -h = Ej J 4 , X = £ 8 x 2 , and Z = A (assuming that Ej j = 0). Assuming 
that AO 2 <^ a 2 , we can compute (Appendix A) the approximate eigenvalues of A as 



A min AO^J^X/a 2 and A max sa J 2 Za 2 



(18) 



The eigenvalues of the information matrix describe an "elliptic" approximation to the error sur- 
face (and hence posterior probability distribution), which matches the true "banana shaped" surface 
near the optimal solution but not far away from it. To determine if the additional nonlinearities in 
the reconstruction process result lower or higher overall uncertainties than those predicted by the 
information matrix, we would have to resort to numerical simulations. In practice, we expect these 
secondary effect to be much smaller than the large variations in eigenvalues which explain most of 
the uncertainties (ambiguities) associated with structure from motion. 



5 Orthography: single scanline 

Let us now turn to a true structure from motion problem where both the structure and motion are 
unknown. For simplicity, we analyze the orthographic scanline camera first, where the unknowns 
are the 2-D point positions p t = (x t} z t ) and the rotation angles 6 r 3 The imaging equations are 

U{j CjX{ Sj^i (19) 

with Cj = cos 6j and = sin 6j . 

3 We do not estimate the horizontal translation since it can be determined from the motion of the centroid of the 
image points [TK92b]. 
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The Jacobian for the 1-D orthographic camera is 



H, 



dui 



du lJ 
~d6~ 



dx l dz 

and the entries in the information matrix are 



Cj %i ~~T~ $ 7 $i) 



Yc 2 

^3 C 3 


Ej c i s i 




c 


-D 


Ej C j S j 






— D 


S 



withC = EjS 2 ' 15 



Ej W S = Ej 4 ^ = Ev *, ? , ^ = Ev ^» and X = Ev A- 



(20) 

(21) 

(22) 
(23) 



Before analyzing the complete information matrix, let us look at the two subblocks A p and A m . 
If we know the motion, the structure uncertainty is determined by A Pt and is simply the triangula- 
tion error, i.e., u 2 x oc C~ l and u 2 oc S~ x (note that for small rotations, a 2 is generally much smaller 
than a 2 z ). If we know the structure, the motion accuracy is determined by A mj and is inversely 
proportional to the variance in depth along the viewing direction (sj, Cj). 

What about ambiguities in the solution? Under orthography, the traditional scale ambiguity does 
not exist. However, translations along the optical axis cannot be estimated, and an overall pose 
(coordinate frame) ambiguity still exists. Unless we add some additional constraints, we can always 
rotate the coordinate system by a AO and add the same amount to the {#?•}. This manifests itself as 
the null (zero eigenvalue) eigenvector 



e 0 



-Xq 



Z N 



-x N 



1 



1 



5.1 Two frames: the bas-relief ambiguity 

Let us say we only have two frames, and we have fixed #0 = 0, c 0 = 1, s 0 = 0, 61 = 6, c x = c, si 
s (Figure 3). Then 

1 + c 2 — cs 



L p t m 



— CS s 



[ c 2 Z + 2csW + s 2 X ] . 



(24) 

(25) 
(26) 
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5 Orthography: single scanline 




(a) stationary first frame 




§6 ^ | l ' 58 

(b) antisymmetric cameras 



Figure 3: Orthographic projection, two frames. 
The solid lines indicate the viewing rays, while the thin lines indicate the optical axes and image 
planes. The diagonal dashed lines are the displaced viewing rays, while the ellipses indicate the 
positional uncertainty in the reconstruction due to uncertainty in motion (indicated as 86). 



The bas-relief ambiguity manifests itself as a null eigenvector 



eo = [ 0 cz 0 + sx 0 0 



as can be verified by inspection. This is as we expected, i.e., the primary uncertainty in the structure 
is entirely in the depth (z) direction, and is a scale uncertainty (proportional to z). Note however 
that this uncertainty is proportional to cz + sx rather than z, as can be seen by inspecting Figure 3a. 

An alternative parameterization of the two-frame problem is to set 9 0 = — #i (Figure 3b), in 
which case we have 



L p 8 m 



2c 2 0 
0 2s 2 

2 c»s x ^ 
2cszi 

[ 2c 2 Z + 2s 2 X ] . 



(27) 

(28) 
(29) 
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In this case, the null eigenvector is 



eo = I s 2 x 0 —c 2 Zi ■ ■ ■ s 2 xn —c 2 zn 



cs] T . (30) 



This is also very illuminating. It shows that the primary effect of the bas-relief ambiguity is a 
"squashing" of the z values for a small increase in motion, with a much smaller "bulging" in the x 
values (at least for small inter- frame rotations). 4 This squashing and bulging is an affine deforma- 
tion of the true structure. 



5.2 More than two frames, equi-angular motion constraint 



To simplify the analysis, we assume for the moment that we know we have an equi-angular image 
sequence, i.e., that the rotation angles are given by 0 3 = jA6, j £ { — J, • • • , J}, J = where 
F is the total number of frames (imagine Figure 3b with more cameras). In this case, we have 



H 



-j(cjZi + SjXi) ] 



Pi 111 







C 0 


0 ^ s ) . 




0 s 


J c j s j x i 




-Ex, 






Ezi 



[ ZjfcjZ + ZjfsjX } = [ C'Z + S'X 



(31) 
(32) 

(33) 
(34) 



with E = jcjSj, C = J 2 c 2 , S' = J 2 s 2 , and C, D, S, Z, W, X defined as in (22-23). In 



this case, the smallest eigenvalue eigenvector has the form 



e 0 



(XXq 



-(3z 0 



OLXn 



-(3z N 



(35) 



This will be an eigenvector if we can satisfy the matrix equation Ae = Ae, i.e., 





A 


A T 

pm 


A 



axo 




axo 


-[3zq 




-fjZQ 




= A 




-j3z N 




-j3z N 


1 




1 



4 Note that compared to the previous example where frame 0 was fixed, the total interframe rotation is now 29. 
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5 Orthography: single scanline 



which reduces to the following three equations: 

aC — E = aX 

fJS-E = fJX 

(S' - aE)X + (C - (3E)Z = A. 

Substituting a = and (3 = into the third equation, we obtain a cubic in A, 

(S - \)(S'(C - A) - E 2 )X + (C - \)(C'(S - A) - E 2 )Z - (S - A)(C - A)A = 0, (36) 

which can be solved analytically using a package such as Mathematical [Wol91]. 

Assuming that the smallest eigenvalue is very small, we can use the approximation a sa ^ to 
obtain a quadratic in A, 

(S - \)(S'C - E 2 )X + C(C'(S - A) - E 2 )Z - (S - \)C\ = 0. (37) 

Furthermore, using the small angle approximations, C 1 = J 0 , ~ A6> 2 J 2 , E A6J 2 , 

C J 2 , and S' sa A6 2 J 4 , we obtain after some manipulation (Appendix A) 

\ ^ APXJ 2 (J 0 J 4 -J*) 

min ~ j 0 j 2 z + Ae*[x(j 0 j 4 - j 2 ) + j 0 j 2 ] ' 

Notice that the minimum eigenvalue is related to the fourth power of AO, i.e., doubling the inter- 
frame rotation reduces the RMS (root mean square) error by a factor of 4 (assuming that Z AO 2 ). 
Increasing the extent of the Xi compared to the Zi directly increases the minimum eigenvalue, i.e., 
it decreases the structure uncertainty. This result is somewhat surprising, and suggests that flatter 
objects can be reconstructed better. 

We can numerically compute the values of A for a range of J and AO values (Figure 4). For ex- 
ample, with J = 1, AO = 0.1 rad sa 6°,andX = Z = 1, we have A = {0.0000664436,1.98064,3.0193}. 
For the smallest eigenvalue, A = 0.0000664436, we have a corresponding a = 0.0666676 and 
(3 = 10.0001. 

Once the smallest eigenvalue and eigenvector have been computed, we can easily determine 
some additional eigenvectors. Any vector which consists purely of x t or z t values which is also 
orthogonal to A pm is an eigenvector, e.g., 



Xi 0 —X 0 0 • • • 0 



0 



5.2 More than two frames, equi- angular motion constraint 
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Figure 4: Plot of log 10 A m ; n as a function of J G [1,8] and A9 G [0.1,1.5]. 



-^min 


F = 2 


F = 3 


F = 4 


F = 5 


F = 6 


F = 7 


F = 8 


#tot = 


11.5° 


0.000000 


0.000067 


0.000079 


0.000088 


0.000096 


0.000104 


0.000112 


#tot = 


22.9° 


0.000000 


0.001087 


0.001283 


0.001418 


0.001547 


0.001677 


0.001810 


&tat = 


34.4° 


0.000000 


0.005618 


0.006597 


0.007277 


0.007931 


0.008594 


0.009269 


#tot = 


= 45° 


0.000000 


0.016854 


0.019688 


0.021673 


0.023596 


0.025552 


0.027547 


#tot = 


= 60° 


0.000000 


0.054679 


0.063442 


0.069678 


0.075782 


0.082017 


0.088389 


#tot = 


= 90° 


0.000000 


0.272977 


0.316453 


0.348500 


0.380039 


0.412200 


0.444997 



Table 1: Minimum eigenvalues for 1-D orthographic known equi-angular motion 



The eigenvalues corresponding to the pure x eigenvectors are C, while the z eigenvalues are S. 
In other words, once the global bas-relief uncertainty has been accounted for (squashing in z and 
smaller bulging in x), the variance in x position estimates is proportional to C~ l and in z positions 
is proportional to S~ l , i.e., exactly the expected triangulation error for known camera positions. 

For the above example with J = 1 (3 frames), A9 = 0.1 rad 6°, and X = Z = 1, the 
values for C and S are 2.98 and 0.0199, respectively. From this, we see that the correlated depth 
uncertainty due to the motion uncertainty is afactorof 0.0199/0.00006644 = 300 times greater than 
the individual depth uncertainties. A full table of A m ; n as a function of F = 2 J + 1 (the number of 
frames) and 9 tot = (F — 1)A9 (the total rotation angle) is shown in Table 1. 
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5 Orthography: single scanline 



-^min 


F = 2 


F = 3 


F = 4 


F = 5 


F = 6 


F = 7 


F = 8 


#tot = 


11.5° 


0.000000 


0.000067 


0.000079 


0.000087 


0.000095 


0.000103 


0.000111 


#tot = 


22.9° 


0.000000 


0.001080 


0.001263 


0.001391 


0.001513 


0.001636 


0.001762 


#tot = 


34.4° 


0.000000 


0.005537 


0.006377 


0.006971 


0.007549 


0.008136 


0.008731 


#tot = 


= 45° 


0.000000 


0.016450 


0.018596 


0.020163 


0.021721 


0.023311 


0.024924 


#tot = 


= 60° 


0.000000 


0.052521 


0.057558 


0.061612 


0.065825 


0.070179 


0.074598 


#tot = 


= 90° 


0.000000 


0.254859 


0.261589 


0.273769 


0.288362 


0.303857 


0.319541 



Table 2: Minimum eigenvalues for 1-D orthographic equi-angular motion with no constraint 
5.3 More than two frames, without motion constraint 

If we take the same data set as above, but remove the additional knowledge of equi-angular steps, 
we end up solving for each motion (angle) estimate separately. The equations for A Pt , A Pjm ., and 
A mj are given in (22-23), with D = 0. Let us guess that the bas-relief ambiguity eigenvector has 
the form 



e 0 



(XXq 



-(3z 0 



-(3z N 



-J 



J 



The requirements for this to be an eigenvector are similar to those we derived before, 

aC — E = aX 
fJS-E = fJX 

cVjZ - aW) + c jSj (2jW -aX - (3Z) + s 2 J (jX - (3W) = Xj. 



(39) 

(40) 
(41) 
(42) 



In this case, we do not have a closed form solution, since we have 2 J + 3 equations in 3 unknowns. 
However, if we assume a small angle approximation and W = 0 (i.e., that the 3-D point cloud is 
rotationally symmetric with respect to the middle frame), then the 2 J + 1 equations of the form (42) 
are equivalent and we get the same eigenvectors as with the known equiangular motion constraint. 

This behavior can be verified numerically (Table 2), where the results are quite similar to those 
shown in Table 1. To obtain these results, we computed the A matrix explicitly using a set of 9 
points sampled on the unit square, i.e., {(x, z),x,z G { — 1,0, 1}}, and then computed the eigen- 
values. Note, however, that for an example where W ^ 0, i.e., by adding one additional point at 
(2, 2) to the previous example, we get an eigenvector which is not of the form hypothesized in (39). 
It is, however, an affine transform of the (x t} z t ) coordinates. 



6 Orthography: full 3-D reconstruction 
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Sail 


F = 2 


F= 3 


F = 4 


F = 5 


F = 6 


F= 7 


F = 8 


0tot = 11-5° 


CO 


123.61 


113.80 


108.18 


103.50 


99.34 


95.60 


Otot = 22.9° 


CO 


31.81 


29.46 


28.02 


26.80 


25.71 


24.73 


Otot = 34.4° 


CO 


14.88 


13.88 


13.21 


12.62 


12.09 


11.62 


Otot = 45° 


CO 


9.32 


8.74 


8.30 


7.92 


7.58 


7.27 


0tot = 60° 


CO 


6.01 


5.65 


5.35 


5.08 


4.85 


4.64 


Otot = 90° 


CO 


3.94 


3.62 


3.37 


3.16 


2.99 


2.84 



Table 3: S a ii estimates for 1-D orthographic equi-angular motion with no constraint, X = Z = 100, 

(7=1. 

We can also estimate the expected reconstruction error S a u by summing the inverse eigenval- 
ues. Using the same parameters as for Table 2, but with X = Z = 100 to make structure errors 
dominate, we obtain the results in Table 3. This table shows how the bas-relief ambiguity dominates 
the reconstruction error. At small viewing angles, doubling the angle results in a fourfold reduction 
in the sum of squared error S a u. Adding more frames is much less effective than increasing the 
effective baseline of the system. 



6 Orthography: full 3-D reconstruction 

The situation with a regular orthographic camera (2-D retina, 3-D world) is quite similar to the 
scanline camera. In this case, we use unit quaternions to represent the rotation matrices, 

Uij = roojX, + r 01j yi + r 02 jZi (43) 
Vij = rwjX, + r nj yi + r 12j Zi, (44) 

where the entries in the rotation matrix r k i are given in (3). 

To obtain a qualitative feel for the bas-relief ambiguity, let us examine the known equiangular 
motion case with a small amount of rotation around a fixed axis (say in the y-z plane), 

Qj ~ [l,(0,jgi,jg 2 )], (45) 
where q x is the incremental rotation around the y axis, and q 2 is the rotation about the z (optical) 
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6 Orthography: full 3-D reconstruction 



axis. As before, we ignore camera translations under orthography, since these can be recovered 
from the motion of the point centroid. 
The Jacobian matrix is now 





duij 


du tJ 


du tJ 


du tJ 


du ij 




H£ = 


dx t 
dvij 


dy t 

dv tJ 


dz t 
dvij 


dqi 

dv tJ 


dv tJ 


(46) 




dx t 


dy t 


dz t 


dqi 


dq2 . 





1 2jq 2 ~2jq x 
-2jq 2 1 2j 2 qi q 2 

The entries in the information matrix are 



-ifqxx, - 2jzi 
2j 2 q 2 z t 



-4j 2 q 2 x t + 2jy % 
-2jxi - 4j 2 q 2 y, + 2j 2 qiz l 



(47) 





Jo 


0 




0 


Jo 




0 


-2J 2 q x q 2 



withF = E t yf,W 



0 

-2J 2 <?1<?2 

4J 2 qj 

-AJ 2 qiX, -2J 2 q 2 x t 
-2J 2 q 2 z t 2J 2 q x z t 
4J 2 qiZ l -AJ 2 qiVi 

4«/2 Ei z 2 -4J 2 Ei ViZi 

- / ±J 2 E t y l z l 4J 2 J2 t {xl + yf 

Ei Vi z i, and other terms as defined before. 



4J 2 



Z W 
W X + Y 



(48) 



(49) 



(50) 



These equations are similar to those for the orthographic scanline camera (22-23), with C ~ Jo, 
S ~ J 2 q\, E J 2 qi, and C" ~ J 2 . In the absence of positional uncertainty, the accuracies of 
the qi and q 2 estimates (A^ 1 ) are inversely proportional to Z and X + Y, respectively, as is to be 
expected. Similarly, with known motion, the triangulation error (A" 1 ) are inversely proportional 
to the number of frames J 0 for x and y, and proportional to the squared rotation angle J 2 q\ for z. 
Notice that a non-zero tilt of the rotation axis (q 2 ^ 0) confounds some of the y and z positional 
uncertainties. 

Instead of trying to find an analytical solution to the eigenvalue problem, we present a brief ex- 
ample showing the dependence of A m ; n on qi and q 2 (Table 4). For this example, we used a 15-point 
data set consisting of the 8 corners of a unit cube, the 6 cube faces, and the origin. The eigenvalues 
for the no-tilt case (q 2 = 0) are almost identical to the results of 1-D analysis (Table 2). The eigen- 
values for the tilted case (q 2 /qi = tan 30°) are similar in shape but show the effect of the overall 
decrease in qi values. By examining the eigenvectors (not shown here), we observe that for both 
cases, the minimum eigenvector has no y components. 



7 Perspective: single scanline 
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-^min 


F = 2 


F = 3 


F = 4 


F = 5 


F = 6 


F = 7 


F = 8 


#tot = 


11.5° 


0.000000 


0.000067 


0.000079 


0.000088 


0.000096 


0.000104 


0.000112 


#tot = 


22.9° 


0.000000 


0.001092 


0.001267 


0.001410 


0.001531 


0.001665 


0.001792 


#tot = 


34.4° 


0.000000 


0.005677 


0.006405 


0.007189 


0.007747 


0.008452 


0.009065 


#tot = 


= 45° 


0.000000 


0.017153 


0.018653 


0.021226 


0.022638 


0.024838 


0.026500 


#tot = 


= 60° 


0.000000 


0.056333 


0.056757 


0.067148 


0.069948 


0.078044 


0.082245 


#tot = 


= 90° 


0.000000 


0.287619 


0.203405 


0.320241 


0.268727 


0.343410 


0.318149 



Table 4: Minimum eigenvalues for 2-D orthographic equi-angular motion with no constraint, rota- 
tion around y axis (q t = sin ^f, q 2 = 0). 



-^min 


F = 2 


F = 3 


F = 4 


F = 5 


F = 6 


F = 7 


F = 8 


#tot = 


11.5° 


0.000000 


0.000046 


0.000055 


0.000061 


0.000066 


0.000072 


0.000077 


#tot = 


22.9° 


0.000000 


0.000750 


0.000873 


0.000971 


0.001056 


0.001148 


0.001236 


#tot = 


34.4° 


0.000000 


0.003857 


0.004392 


0.004919 


0.005316 


0.005795 


0.006224 


#tot = 


= 45° 


0.000000 


0.011507 


0.012731 


0.014410 


0.015451 


0.016919 


0.018101 


#tot = 


= 60° 


0.000000 


0.036927 


0.038640 


0.044940 


0.047420 


0.052530 


0.055737 


#tot = 


= 90° 


0.000000 


0.170400 


0.150632 


0.200555 


0.196403 


0.233575 


0.235277 



Table 5: Minimum eigenvalues for 2-D orthographic equi-angular motion with no constraint, rota- 
tion around y axis tilted 30° (qi = cos 30° sin ^f, q 2 = sin 30° sin ^f). 

7 Perspective: single scanline 

Before analyzing the perspective camera in 3-D, let us briefly look at a perspective scanline (1-D) 
camera. We can use this model to develop some intuitions, but unfortunately we cannot use it to 
predict the performance of the full two-frame algorithm, since even under perspective, the scanline 
camera has a bas-relief ambiguity. This can be shown by a simple parameter counting argument: 
there are 2N unknowns for the 2-D coordinates Zi)} and 1 (or more) unknowns for the motion, 
but only 2N measurements. In other words, we can place the cameras arbitrarily, and the intersec- 
tions of the optical rays will determine the location of the 2-D points. This argument obviously does 
not carry over to 3-D, but it is suggestive of why two-frame structure from motion may be poorly 
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7 Perspective: single scanline 



Amin 


F = 2 


F = 3 


F = 4 


F = 5 


F = 6 


F = 7 


F = 8 


#tot = 


11.5° 


0.000000 


0.000080 


0.000094 


0.000104 


0.000114 


0.000124 


0.000133 


#tot = 


22.9° 


0.000000 


0.001274 


0.001498 


0.001655 


0.001807 


0.001960 


0.002116 


#tot = 


34.4° 


0.000000 


0.006421 


0.007489 


0.008257 


0.009006 


0.009768 


0.010544 


#tot = 


= 45° 


0.000000 


0.018670 


0.021580 


0.023745 


0.025885 


0.028072 


0.030305 


#tot = 


= 60° 


0.000000 


0.057351 


0.065494 


0.071906 


0.078373 


0.085026 


0.091834 


#tot = 


= 90° 


0.000000 


0.255136 


0.288877 


0.317718 


0.347360 


0.377933 


0.409211 



Table 6: Minimum eigenvalues for 1-D perspective projection, equi-angular rotation, rj = 0.2. 



conditioned. 

The projection equation for a scanline camera, using the new projection model introduced in 

(5), is 

CxXx — fix 7.x 4- /„_• Nxx 

(51) 



Ux 



X'J 



1 + T]{sjXi + CjZi + t z j) 



The Jacobian matrix is 



H 



36 x 



3t x 



3t z 



(52) 



Da 



— Cj - J/SjMij -(Sj + TjCjUij) -(sjXi + + rj(cjX 



-TjUy 



where Ui 3 is the predicted value of Ui 3 computed by (51). In addition to the usual coordinate frame 
ambiguity, we have a scale ambiguity, i.e., the (xi , Zi ) and t XJ can be multiplied by a factor a, and t ZJ 
can be set to at ZJ + (a — l without affecting the solution. As mentioned above, a full bas-relief 
ambiguity also exists for 2 frames. 

Rather than continuing our analysis with the construction of the Hessian matrix A, let us just 
look briefly at the form of H 8J . In addition to the terms already present under orthography (20), we 
have the extra terms involving rj, as well as the partial with respect to t ZJ . These additional terms 
are what will, in full 3-D, enable the two-frame problem to be solved by removing the bas-relief 
ambiguity. 

To see the effects of using a perspective camera instead of an orthographic camera, we show 
in Table 6 the minimum eigenvalue as a function of total viewing angle and number of frames. 
Compared to Table 2, we see that there is a small, but not dramatic, improvement in the size of 



8 Perspective in 3-D 
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-^min 


F = 2 


F = 3 


F = 4 


F = 5 


F = 6 


F = 7 


F = 8 


#tot = 


11.5° 


0.000175 


0.000214 


0.000239 


0.000269 


0.000299 


0.000331 


0.000364 


#tot = 


22.9° 


0.000690 


0.001289 


0.001462 


0.001633 


0.001803 


0.001981 


0.002158 


#tot = 


34.4° 


0.001512 


0.004372 


0.004972 


0.005491 


0.006009 


0.006510 


0.007024 


#tot = 


= 45° 


0.002512 


0.009905 


0.011282 


0.012020 


0.012959 


0.013460 


0.014070 


#tot = 


= 60° 


0.004234 


0.020246 


0.022853 


0.021650 


0.021870 


0.020495 


0.019727 


#tot = 


= 90° 


0.008381 


0.032074 


0.032623 


0.027976 


0.026149 


0.023367 


0.021596 



Table 7: Minimum eigenvalues for 3-D perspective projection, equi- angular rotation around y axis, 

T] = 0.1. 

8 Perspective in 3-D 

Let us finally analyze the most interesting case, that of a perspective camera operating in a 3-D 
environment. Here, we know that the two-frame problem has a solution, although our results on 
the simpler camera models suggest that the reconstructions may be particularly sensitive to noise. 

The forward imaging equations are given in (1-3) and (5). We will not bother deriving the Ja- 
cobian and Hessian matrices here, as they are complex and not particularly informative. Instead, 
we present some numerical results on A m ; n and RMS pos and discuss their significance. (Note that 
RMSp 0S = Sp 0S /y/n, where n is the number of points.) These results were obtained using the 
Mathematical package [Wol91], by analytically differentiating the forward projection equations, 
and then substituting in the known structure and motion parameters. Numerical eigenvalue analysis 
was then used to obtain our results. For these examples, we used the 15 points sampled on the unit 
cube described in Section 6. 

We present results for two special cases : pure object-centered rotation (which in camera-centered 
coordinates is actually both rotation and translation), and pure forward translation. Ignoring the ef- 
fects of motion across the retina, these two cases capture the basic motion cues available to structure 
from motion. 

8.1 Pure object-centered rotations 

To compute the minimum eigenvalue results, we used the same approach as for the orthographic 3-D 
camera (Section 6). The computed eigenvalues are shown in Table 7. Compared to the orthographic 
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8 Perspective in 3-D 



-^min 


i] = 0.025 


77 = 0.05 


i] = 0.1 


i] = 0.2 


77 = 0.3 


i] = 0.4 


i] = 0.5 






4> = r 


4> = %° 


<f> = 12° 


(f> = 28° 


^=46° 


(f> = 67° 


(f> = 90° 


#tot = 


11.5° 


0.000010 


0.000041 


0.000175 


0.000899 


0.002648 


0.003899 


0.002947 


#tot = 


22.9° 


0.000040 


0.000161 


0.000690 


0.003505 


0.010216 


0.015504 


0.011702 


#tot = 


34.4° 


0.000087 


0.000354 


0.001512 


0.007578 


0.021758 


0.034461 


0.025941 


#tot = 


= 45° 


0.000145 


0.000591 


0.002512 


0.012402 


0.035035 


0.057861 


0.043377 


#tot = 


= 60° 


0.000247 


0.001002 


0.004234 


0.020494 


0.056570 


0.097234 


0.072229 


#tot = 


= 90° 


0.000492 


0.001993 


0.008381 


0.039718 


0.105540 


0.144799 


0.111384 



Table 8: Minimum eigenvalues for 3-D perspective projection, equi-angular rotation around y axis, 
two frames (F = 2), varying r/. <f> is the camera's field of view. 

case (Table 4), we see some striking differences. First, the two-frame problem is now soluble (up 
to a scale ambiguity, of course). Second, for small viewing angles, there is marked improvement 
even for multiple frames. Third, the results for large viewing angles with small r/'s are significantly 
inferior to the orthographic results. This appears to be caused by ambiguities in camera motion 
along the optical axis (t z ), which are neglected in the orthographic case. 

This table only shows us the results for a particular value of r/. The dependence of A min on r/ is 
presented in Tables 8 and 9 for the two and three frame problems. In these tables, the fields of view 
equivalent to each r/ were computed from the horizontal spread of the data points on the unit cube 
and the distance of the cube from the camera -q' 1 using the formula <f> = 2 tan -1 j^. As can be 
seen for the two-frame case, doubling the amount of perspective distortion r/ results in a fourfold 
increase in A min (and hence a halving of the RMS error). For the three-frame case, the results are 
less sensitive to r/. 

What does a typical minimum eigenvector look like? Figure 5 shows the eigenvector corre- 
sponding to the three-frame problem with 77 = 0.1 and 0 tot = 11.5°. As we can see, the majority of 
the ambiguity is indeed a depth scaling. Notice, however, that the eigenvector is not a pure affine 
transform of the 3-D coordinates, since the tips of the vectors in a given row do not form a straight 
line (this has also been verified numerically). Our conjecture is that the minimum eigenvector may 
be a projective transformation of the 3-D points, i.e., that the main ambiguity is projective, but we 
have not yet found a proof for this conjecture. 

How do the 3-D (position) errors RMS pos vary with the number of frames and viewing angle? 



8.1 Pure object-centered rotations 
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-^min 


i] = 0.025 

4> = r 


77 = 0.05 

4> = %° 


i] = 0.1 

<f> = 12° 


i] = 0.2 

(f> = 28° 


77 = 0.3 

^=46° 


i] = 0.4 

(f> = 67° 


i] = 0.5 

(f> = 90° 


#tot = 


11.5° 


0.000043 


0.000075 


0.000214 


0.000956 


0.002736 


0.003908 


0.002958 


#tot = 


22.9° 


0.000502 


0.000688 


0.001289 


0.004384 


0.011565 


0.015655 


0.011874 


#tot = 


34.4° 


0.001399 


0.002606 


0.004372 


0.011820 


0.028129 


0.035277 


0.026838 


#tot = 


= 45° 


0.001825 


0.005074 


0.009905 


0.023998 


0.052204 


0.060488 


0.046154 


#tot = 


= 60° 


0.002009 


0.007177 


0.020246 


0.051574 


0.103096 


0.107273 


0.082110 


#tot = 


= 90° 


0.002098 


0.008302 


0.032074 


0.121672 


0.205362 


0.215310 


0.181425 



Table 9: Minimum eigenvalues for 3-D perspective projection, equi-angular rotation around y axis, 
three frames (F = 3), varying r/. <f> is the camera's field of view. 



RMSp 0S 


F = 2 


F = 3 


F = 4 


F = 5 


F = 6 


F = 7 


F = 8 


0tot = 11-5° 


20.78 


19.08 


18.04 


17.02 


16.12 


15.32 


14.62 


Otot = 22.9° 


10.51 


8.09 


7.61 


7.19 


6.83 


6.51 


6.23 


Otot = 34.4° 


7.13 


4.64 


4.38 


4.13 


3.94 


3.75 


3.60 


Otot = 45° 


5.57 


3.24 


3.06 


2.89 


2.76 


2.63 


2.53 


Otot = 60° 


4.35 


2.32 


2.19 


2.07 


1.98 


1.89 


1.82 


Otot = 90° 


3.25 


1.70 


1.59 


1.49 


1.43 


1.37 


1.32 



Table 10: RMS pos for 3-D perspective projection, equi-angular rotation around y axis, r\ = 0.1. 

By computing the full covariance matrix (inverting A) and taking the trace of the positional co- 
variance matrix C p (as described in Section 3.2), we obtain the results shown in Table 10. These 
numbers indicate the relative errors in reconstruction for a unit retina and unit noise. For example, 
if the retina is actually 200 pixels wide (s = 100 in (5)) and the positional error in the tracked points 
is a = 0.1, then the 3-D reconstruction errors would be 1000 times smaller than the values given 
in Table 10. We see that this error decreases linearly with total viewing angle (for small viewing 
angles), and varies only slightly with the total number of frames. This is similar to the results ob- 
tained when computing A m ; n in Table 4 (remember that RMS error should be proportional to the 
square root of the inverse eigenvalues). 
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8 Perspective in 3-D 
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Figure 5: Minimum eigenvector for a three-frame perspective reconstruction problem: (a) top- 
down view (.( -:), (b) frontal view (x-y). While the main ambiguity is a z scaling, the vector is 
not exactly an affine transform of the 3-D points on the unit cube. 

8.2 Looming 

The motion of a camera forward in a 3-D world creates a different kind of parallax, which can also 
be exploited to compute structure from motion. To compute the ambiguities in this kind of motion, 
we used the same approach as before, except with no rotation and pure forward motion (/ . ^ 0). 

Using our usual 15-point data set results in some unexpected behavior: four of the eigenvalues 
are zero. This is because the z coordinates of the three points on the optical axis cannot be recov- 
ered as they lie on the focus of expansion. This is a severe limitation of recovering structure from 
looming: points near the focus of expansion are recovered with extremely poor accuracy. For the 
experiments in this section, we use a 12-point data set instead, i.e., the 15-point set with the three 
points (x, y) = (0, 0) removed. 

Table 1 1 shows A m ; n as a function of the number of frames F and the total extent of forward 
motion t z (the object being viewed is a unit cube with coordinates [—1, l] 3 ). These results are for 
a camera with r\ = 0.3, i.e., a camera placed about 3.3 units away from the cube origin. As we 
can see, the two-frame results are almost as good at the three frame results with the same extent of 
motion. The value of A m ; n appears to depend quadratically on the total extent of motion. Overall, 
however, these results are much worse than those available with object-centered rotation. 

Table 12 shows A m ; n as a function of rj, i.e., the distance of the central frame to the object. It 
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^min 


F = 2 


F = 3 


F = 4 


F = 5 


F = 6 


t z = 


0.1 


0.000007 


0.000007 


0.000007 


0.000008 


0.000009 


t z = 


0.2 


0.000027 


0.000027 


0.000030 


0.000033 


0.000037 


t z = 


0.3 


0.000060 


0.000060 


0.000067 


0.000075 


0.000084 


tz = 


0.4 


0.000107 


0.000107 


0.000119 


0.000134 


0.000150 


tz = 


0.5 


0.000168 


0.000168 


0.000187 


0.000210 


0.000235 



Table 11: Minimum eigenvalues for 3-D perspective projection, pure forward translation, r\ = 0.3. 



^min 


i] = 0.1 

<f>= 12° 


77 = 0.2 

^ = 28° 


i] = 0.3 

<f> = 46° 


i] = 0.4 

(f> = 67° 


i] = 0.5 

(f> = 90° 


tz = 


0.1 


0.000000 


0.000002 


0.000007 


0.000013 


0.000020 


tz = 


0.2 


0.000001 


0.000009 


0.000027 


0.000051 


0.000078 


tz = 


0.3 


0.000002 


0.000020 


0.000060 


0.000115 


0.000176 


tz = 


0.4 


0.000004 


0.000036 


0.000107 


0.000205 


0.000314 


tz = 


0.5 


0.000006 


0.000057 


0.000168 


0.000320 


0.000490 



Table 12: Minimum eigenvalues for 3-D perspective projection, pure forward translation, F = 2, 
varying r/. 

appears that A min depends cubically on r/, at least for small t z s. To obtain reasonable estimates, 
therefore, it is necessary to both use a wide field of view and a large amount of motion relative to 
the scene depth. 

Figure 6. shows the structural part of the minimum eigenvectors in particular for r/ = 0.3, J = 1 
(F = 3), and At z = 0.2. eigenvector whose 3-D structure is shown in Figure 6. By inspection of 
the complete eigenvector (not shown here), we can see that the ambiguity is between the amount 
of x and y yaw and x and y translation, i.e., it is a classic bas-relief ambiguity. 

9 Experimental results 

To verify if the positional errors predicted by our analysis coincide with the errors observed in prac- 
tice, we ran our iterative non-linear least squares algorithm on a 24-point sample data set [SK94]. 
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9 Experimental results 




(a) (b) 

Figure 6: Minimum eigenvector for a three-frame perspective reconstruction problem with pure 
z translation: (a) top-down view (.<•-:), (b) frontal view (x-y). The main ambiguity is a rocking 
confusion between sideways camera translation and rotation, which affects the points furthest back. 



The 24 points were four points at (±0.4685, ±0.4685) on the six faces of a unit [—1, ±1] 3 cube. The 
points were projected onto a 200 pixel wide retina (s = 100 in (5)) and 2-D noise with (7 = 0.1 was 
added to each projected point. 5 The algorithm was then initialized with the correct 3-D structure 
and run to completion. 

The 3-D positional errors are shown in Tables 13 and 14. Three kinds of error are shown: the 
Euclidean error, after registering the recovered and true 3-D data sets under the best possible sim- 
ilarity transform (rigid ± scaling); the affine error (computing the best affine transform); and the 
projective error (computing the best 4x4 homography). These errors were scaled by a factor of 
1000 to make them "dimensionless" (i.e., unit retina, unit image noise). The RMS error predicted 
by our uncertainty analysis (the trace of the positional covariance matrix) is also shown. 

From these results, we can see that the uncertainty analysis predicts the general variation of re- 
construction error with viewing angle, perspective distortion, and number of frames. Unfortunately, 
there remains a small but fairly consistent discrepancy between our predicted figures and the mea- 
sured errors, which we have not been able to track down. We also see that the affine error is about 
2 to 3 times lower than the Euclidean error (actually, this factor increases with decreasing viewing 

5 The results scale linearly with a up to about a = 1, after which they increase sub-linearly (i.e., they less than 
double when a is doubled). 
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RMSp 0S 


i] = 0.1 


i] = 0.2 


F = 2 


predicted 


Euclidean 


affine 


projective 


predicted 


Euclidean 


affine 


projective 


#tot = 8° 


35.02 


58.98 


20.41 


19.02 


19.68 


34.43 


21.68 


20.48 


0tot = 16° 


18.21 


35.70 


10.27 


9.39 


9.93 


16.63 


10.39 


9.75 


^tot = 32° 


9.28 


15.70 


5.10 


4.78 


5.13 


9.15 


5.34 


4.98 


0tot = 60° 


5.24 


8.47 


2.89 


2.72 


3.02 


4.69 


3.01 


2.82 


0tot = 90° 


3.85 


5.36 


2.03 


1.93 


2.37 


3.32 


2.15 


2.04 



Table 13: RMS errors (predicted and observed) for 3-D perspective projection, equi-angular rota- 
tion around y axis, two frames, 24 point data set. 



RMSp 0S 


i] = 0.1 


i] = 0.2 


F = 3 


predicted 


Euclidean 


affine 


projective 


predicted 


Euclidean 


affine 


projective 


0tot = 6° 


41.94 


61.17 


20.21 


18.76 


25.79 


40.45 


22.21 


20.26 


#tot = 12° 


19.83 


26.90 


10.31 


9.69 


12.55 


18.12 


10.39 


9.71 


0tot = 24° 


7.42 


11.34 


4.99 


4.76 


5.75 


8.08 


5.23 


4.91 


0tot = 48° 


2.76 


3.70 


2.50 


2.43 


2.59 


3.63 


2.72 


2.61 


9tot = 90° 


1.59 


1.96 


1.54 


1.50 


1.57 


1.90 


1.59 


1.53 



Table 14: RMS errors (predicted and observed) for 3-D perspective projection, equi-angular rota- 
tion around y axis, three frames, 24 point data set. 

angle, as predicted by our analysis). The projective error is not significantly lower than the affine 
error, which further supports our hypothesis that most of the error is in the bas-relief ambiguity. 6 

10 Discussion 

The results presented in this technical report suggest that in many situations where structure from 
motion might be applied, the solutions are extremely sensitive to noise. In fact, despite dozens 
of algorithms having been developed, very few results of convincing quality are available. Those 

6 It is not suprrising that the projective error is always smaller than the affine error, as there are 3 more degrees of 
freedom (15 vs. 12) in the projective fit used before the error computation. 
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cases where metrically accurate results have been demonstrated almost always use a large amount 
of rotation [TK92b] . 

This raises the obvious question: are any of the many structure from motion algorithms de- 
veloped in the computer vision community of practical significance? Or, when we wish to perform 
metrically accurate reconstructions from images, should be adopt the photogrammetrists' approach 
of using control points at known locations? This essentially reduces structure from motion to cam- 
era pose estimation (and possibly calibration) followed by stereo reconstruction. 

The situation is perhaps not that bad. For large object rotations, we can indeed recover accurate 
reconstructions. Furthermore, for scene reconstruction, using cameras with large fields of view, 
several camera mounted in different directions, or even panoramic images, should remove most of 
the ambiguities. In any case, it would appear prudent to carefully analyze the expected ambiguities 
and uncertainties in any structure from motion problem (or any other image-based estimation task) 
before actually putting a method into practice. 

The general approach developed in this technical report, i.e., eigenvalue analysis of the Hessian 
(information) matrix appears to explain most of the known ambiguities in structure from motion. 
However, there are certain ambiguities (e.g., depth reversals under orthography, or multiplicities of 
solutions with few points and frames) which will not be detected by this analysis because they cor- 
respond to multiple local minima of the cost function in the parameter space. Furthermore, analysis 
of the information matrix can only predict the sensitivity of the results to small amounts of image 
noise. Further study using empirical methods is required to determine the limitations of our ap- 
proach. 

Using the minimum eigenvalue to predict the overall reconstruction error may fail when the 
dominant ambiguities are in the motion parameters (e.g., what appears to be happening under per- 
spective for large motions). Computing the RMS pos error directly from the covariance matrix A -1 
is more useful in these cases. 

10.1 Future work 

In future work, we plan to compare results available with object-centered and camera-centered rep- 
resentations (Equations 4-5). Our guess is that the former will produce estimates of better quality. 
Similarly, we would like to analyze the effects of mis-estimating internal calibration parameters 
such as focal length, and to study the feasibility of estimating them as part of the reconstruction 
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process. The results presented here have assumed for now that feature points are visible in all im- 
ages. Our approach generalizes naturally to missing data points. In particular, we would like to 
study the effects feature tracks with relatively short lifetimes. 

Finally, it appears that the portion of the uncertainty matrix which is correlated can be accounted 
for by a small number of modes. This suggest that an efficient recursive structure from motion 
algorithm could be developed which avoids the need for using full covariance matrices [TH093] 
but which performs significantly better than algorithms which ignore such correlations. 

11 Conclusions 

This technical report has developed new techniques for analyzing the fundamental ambiguities and 
uncertainties inherent in structure from motion. Our approach is based on examining the eigenval- 
ues and eigenvectors of the Hessian matrix in order to quantify the nature of these ambiguities. The 
eigenvalues can also be used to predict the overall accuracy of the reconstruction. 

Under orthography, the bas-relief ambiguity dominates the reconstruction error, even with large 
numbers of frames. This ambiguity disappears, however, for large object-centered rotations. For 
perspective cameras, two-frame solutions are possible, but there must still be a large amount of ob- 
ject rotation for best performance. Using three of more frames avoids some of the sensitivities asso- 
ciated with two-frame reconstructions. Translations towards the object are an alternative source of 
shape information, but these appear to be quite weak unless large fields of views and large motions 
are involved. 

When available, prior information about the structure or motion (e.g., absolute distances, per- 
pendicularities) can be used to improve the accuracy of the reconstructions. Whether 3-D recon- 
struction errors (for modeling) or motion estimation errors (for navigation) are most significant for 
a given application determines the conditions which produce acceptable results. In any case, care- 
ful error analysis is essential in ensuring that the results of structure from motion algorithms are 
sufficiently reliable to be used in practice. 

References 

[Adi89] G. Adiv. Inherent ambiguities in recovering 3-D motion and structure from a 



30 



1 1 Conclusions 



noisy flow field. IEEE Transactions on Pattern Analysis and Machine Intelligence, 
ll(5):477-490,May 1989. 

[AHP93] A. Azarbayejani, B. Horowitz, and A. Pentland. Recursive estimation of structure and 
motion using relative orientation constraints. In IEEE Computer Society Conference 
on Computer Vision and Pattern Recognition (CVPR'93), pages 294-299, New York, 
New York, June 1993. 

[Aya91] N. Ayache. Artificial Vision for Mobile Robots: Stereo Vision and Multisensory Per- 
ception. MIT Press, Cambridge, Massachusetts, 1991. 

[CT92] Q. Chen and S. Tsuji. A hierarchical method that solves the shape and motion from 
an image sequence problem. In IEEE/RSJ Int'l Conference on Intelligent Robots and 
Systems, pages 2131-2138, July 1992. 

[CWC90] N. Cui, J. Weng, and P. Cohen. Extended structure and motion analysis from 
monocular image sequences. In Third International Conference on Computer Vision 
(ICCV'90), pages 222-229, Osaka, Japan, December 1990. IEEE Computer Society 
Press. 

[DA90] C. H. Debrunner and N. Ahuja. A direct data approximation based motion estimation 
algorithm. In 10th Int'l Conference on Pattern Recognition, pages 384-389, 1990. 

[Fau92] O. D. Faugeras. What can be seen in three dimensions with an uncalibrated stereo 
rig? In Second European Conference on Computer Vision (ECCV'92), pages 563- 
578, Santa Margherita Liguere, Italy, May 1992. Springer- Verlag. 

[HG93] R. Hartley and R. Gupta. Computing matched-epipolar projections. In IEEE Com- 
puter Society Conference on Computer Vision and Pattern Recognition (CVPR'93), 
pages 549-555, New York, New York, June 1993. IEEE Computer Society. 

[HGC92] R. Hartley, R. Gupta, and T. Chang. Stereo from uncalibrated cameras. In IEEE Com- 
puter Society Conference on Computer Vision and Pattern Recognition (CVPR'92), 
pages 761-764, Champaign, Illinois, June 1992. IEEE Computer Society Press. 

[Hor90] B. K. P. Horn. Relative orientation. International Journal of Computer Vision, 
4(l):59-78, January 1990. 

[KTJ89] R. V. R. Kumar, A. Tirumalai, and R. C. Jain. A non-linear optimization algorithm for 
the estimation of structure and motion parameters. In IEEE Computer Society Con- 



1 1 Conclusions 



31 



ference on Computer Vision and Pattern Recognition ( CVPR '89), pages 136-143, San 
Diego, California, June 1989. IEEE Computer Society Press. 

[KvD91] J. J. Koenderink and A. J. van Doom. Affine structure from motion. Journal of the 
Optical Society of America A, 8:377-385538, 1991. 

[LH81] H. C. Longuet-Higgins. A computer algorithm for reconstructing a scene from two 
projections. Nature, 293:133-135, 1981. 

[LH86] H. C. Longuet-Higgins. Visual motion ambiguity. Vision Research, 26(1): 181— 183, 
1986. 

[MQVB92] R. Mohr, L. Quan, F. Veillon, and B. Boufama. Relative 3D reconstruction using mul- 
tiple uncalibrated images. Technical Report RT 84-IMAG-12, LIFIA — IRIMAG, 
Grenoble, France, June 1992. 

[MVQ93] R. Mohr, L. Veillon, and L. Quan. Relative 3D reconstruction using multiple uncali- 
brated images. In IEEE Computer Society Conference on Computer Vision and Pat- 
tern Recognition (CVPR'93), pages 543-548, New York, New York, June 1993. 

[OT91] J. Oliensis and J. I. Thomas. Incorporating motion error in multi-frame structure from 
motion. In IEEE Workshop on Visual Motion, pages 8-13, Princeton, New Jersey, 
October 1991. IEEE Computer Society Press. 

[PFTV92] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes 
in C: The Art of Scientific Computing. Cambridge University Press, Cambridge, Eng- 
land, second edition, 1992. 

[SA89] M. E. Spetsakis and J. Y. Aloimonos. Optimal motion estimation. In IEEE Workshop 
on Visual Motion, pages 229-237, Irvine, California, March 1989. IEEE Computer 
Society Press. 

[SA91] M. E. Spetsakis and J. Y. Aloimonos. A multiframe approach to visual motion per- 
ception. International Journal of Computer Vision, 6(3):245-255, August 1991. 

[Sha93] A. Shashua. Projective depth: A geometric invariant for 3D reconstruction from two 
perspective/orthographic views and for visual recognition. In Fourth International 
Conference on Computer Vision (ICCV'93), pages 583-590, Berlin, Germany, May 
1993. IEEE Computer Society Press. 

[SK94] R. Szeliski and S. B. Kang. Recovering 3D shape and motion from image streams 



32 



1 1 Conclusions 



using nonlinear least squares. Journal of Visual Communication and Image Repre- 
sentation, 5(1): 10-28, March 1994. 

[Sor80] H. W. Sorenson. Parameter Estimation, Principles and Problems. Marcel Dekker, 
New York, 1980. 

[SPFP93] S. Soatto, P. Perona, R. Frezza, and G. Picci. Recursive motion and structure esti- 
mation with complete error characterization. In IEEE Computer Society Conference 
on Computer Vision and Pattern Recognition (CVPR'93), pages 428-433, New York, 
New York, June 1993. 

[SZB93] L. S. Shapiro, A. Zisserman, and M. Brady. Motion from point matches using affine 
epipolar geometry. OUEL 1994/93, Oxford University Robotics Research Group, 
April 1993. 

[TH84] R. Y. Tsai and T. S. Huang. Uniqueness and estimation of three-dimensional motion 
parameters of rigid objects with curved surfaces. IEEE Transactions on Pattern Anal- 
ysis and Machine Intelligence, PAMI-6(1): 13-27, January 1984. 

[TH093] J. I. Thomas, A. Hanson, and J. Oliensis. Understanding noise: The critical role of mo- 
tion error in scene reconstruction. In Fourth International Conference on Computer 
Vision (ICCV'93), pages 325-329, Berlin, Germany, May 1993. IEEE Computer So- 
ciety Press. 

[TK92a] C. J. Taylor and D. J. Kriegman. Structure and motion from line segments in multiple 
images. In IEEE International Conference on Robotics and Automation, pages 1615- 
1621, Nice, France, May 1992. IEEE Computer Society Press. 

[TK92b] C. Tomasi and T. Kanade. Shape and motion from image streams under orthogra- 
phy: A factorization method. International Journal of Computer Vision, 9(2): 137- 
154, November 1992. 

[TKA91] C. J. Taylor, D. J. Kriegman, and P. Anandan. Structure and motion in two dimensions 
from multiple images: A least squares approach. In IEEE Workshop on Visual Motion, 
pages 242-248, Princeton, New Jersey, October 1991. IEEE Computer Society Press. 

[WAH89] J. Weng, N. Ahuja, and T. S Huang. Optimal motion and structure information. 

In IEEE Computer Society Conference on Computer Vision and Pattern Recognition 
(CVPR'89), pages 144-152, San Diego, California, June 1989. IEEE Computer Soci- 



A Approximate minimum eigenvalue computation 



33 



ety Press. 

[WAH93] J. Weng, N. Ahuja, and T. S. Huang. Optimal motion and structure estimation. IEEE 
Transactions on Pattern Analysis and Machine Intelligence, 15(9):864-884, Septem- 
ber 1993. 

[Wol91] S. Wolfram. Mathematica™ , A System for Doing Mathematics by Computer. 
Addison- Wesley, 1991. 

[YC92] G.-S. Y. Young and R. Chellappa. Statistical analysis of inherent ambiguities in re- 



covering 3-d motion from a noisy flow field. IEEE Transactions on Pattern Analysis 
and Machine Intelligence, 14(10):995-1013, October 1992. 



A Approximate minimum eigenvalue computation 



The eigenvalues of a matrix of the form 



a 2 ab 
ab b 2 + c 2 



are the solutions to 



X(a 2 + b 2 




- a 2 c 2 = 0, 



i.e., 




or for c 2 <C a 2 + b 2 



X 



X 



min 



max 




Similarly, for a quadratic of the form 



aX 2 - bX + c = 0 



with ac <C b 2 , 

b — \/b 2 — 4ac c 
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A Approximate minimum eigenvalue computation 



To find the approximate minimum eigenvalue for the equiangular orthographic scanline camera, 
we substitute the values C sa £v : 1 = J 0 , S sa 6 2 J 2 , E sa 6J 2 , C sa J 2 , and 5" sa 6 2 J 4 , into (37), 

0 = C\ 2 -(SC + (S'C-E 2 )X + CC'Z)\ + S(S'C-E 2 )X + C(C'S-E 2 )Z 

^ J 0 A 2 - (J 0 «/ 2 (^ 2 + + ^ 2 (Jo«/ 4 - J 2 2 )X)\ + 9 4 J 2 (J 0 J 4 - J 2 )X + # 2 J 0 (</ 2 2 - «/ 2 2 )Z. 



Using the approximation in (53), we obtain 



